For this assignment we practice data transformation using a dataset of the daily prices and daily trading volumes of a group of large technology stocks that trade on US stock exchanges. Click here to download stocks.csv, which contains data going back to 2000. The dataset (additional github link here) contains several variables, including:
symbol: which is the ticker symbol for the stock
date: which is the trading date
open, high, low, and close, which are the price at the start of trading, the high price during the day, the low price during the day, and the stock price at the close of trading (unit is USD)
adjusted: which is the stock price at close adjusted for the financial effects of special events (such as dividends). Unit is USD
volume: which is the number of shares which traded during a given trading day.
All of the the stock prices and the trading volume have been adjusted for stock splits, so that the data provide a continuous record of how prices and trading volume changed.
There are several important functions and packages that you will need to use to complete this exercise.
We will use the TTR package (which is part of the tidyquant family of packages, see here). The main function we will use is called runMean. An alternative package that is also very nice but not part of the tidyverse is RcppRoll
If you aren’t very comfortable with logarithms, you should read more about them. They are one of the most important mathematical functions for data science. We aren’t using their mathematical properties much this week but they will be important throughout your data science journey. Khan Academy has a decent video, and this article in the journal nature has some more context.
We will calculate some correlation coefficients, using the cor function from base R (?cor to see how it is used). There is also a tidyverse package called corrr that is useful for calculating correlations on data frames, but we won’t use it for this lab.
The motivation for today’s assignment came from some news articles a few years ago about how big tech stocks collectively had a miniature meltdown after powering the stock market for several consecutive years, see this article at Morningstar
Problem 1: The price of a stock on a given day only conveys information in relation to the stock price on other days. One useful measure is the daily return of the stock, which we will define as the ratio of the adjusted closing price on the current day of trading to the adjusted closing price on the previous day of trading. Read the following article on window functions in dplyr: window functions in dplyr.
Daily Return: Find a function there that will help you calculate the daily return and use it along with mutate to add a return column to the data frame containing the daily return.
Hint: make sure to use group_by(symbol), otherwise your calculation might transpose prices from a different stock at the beginning of each time series.
Finding Special Events: Differences between the return and the return calculated based on the close price should indicate special corporate events such as dividends. Calculate the unadjusted return using the same technique you used to calculate the return, but replacing the adjusted variable with the close variable, and find the datapoint in the dataset where the return exceeded the unadjusted return by the greatest margin.
Hint to check you have done it right: it happened in November 2004. The reason that the close price and the adjusted price differ is because stock prices typically decrease when a dividend is paid (to account for the cash paid out). The adjusted value has been modified from the beginning of the initial data record to increase adjusted to compensate for dividends. A dividend is just a payment that a company makes periodically to those who hold stock.
If you are curious: Look for an old news article describing the significance of that event and tell me what happened
Problem 2: When working with stock price fluctuations or other processes where a quantity increases or decreases according to some multiplicative process like a growth rate (for example population growth) it is often better to work with the log of the growth rate rather than the growth rate itself. This allows standard summary statistics such as the mean to have a useful interpretation (otherwise you would have to use the geometric mean). Furthermore, the log transform is often useful to use on variables that are strictly positive, such as population growth rates or daily stock returns. To see why, consider a hypothetical stock which had a return of 0.5 (50% loss) on one day and 1.8 on the next day (80% gain). The mean of these two returns would be 1.075, or 7.5% per day. However, at the end of the two day period the stock would have lost 10% of its value (0.5*1.8 = 0.9). If we had computed the mean of the log(return) instead, we would have found that (log(0.5)+log(1.8))/2 = log(0.9^(1/2)), or approximately -5.2% per day, matching the observed price change.
Distribution of return Create a new variable called log_return which is the log of the return variable you calculated in the previous problem. Generate either a histogram or density plot of the distribution of log_return for the entire dataset. Then create a QQ-plot of log_return using geom_qq() and geom_qq_line(). What do you notice about the “tails” (right and left side/extreme edges) of the distribution from the QQ-plot? Are there visible signs of this in the density plot/histogram that you made?
Problem 3:Volume measures how many shares were traded of a given stock over a set time period, and high volume days often associate with important events or market dynamics.
Volume-Return Covariation: Create a new variable called log_volume which is the log(volume). Make a scatter plot of log_volume versus log_return, faceted by symbol to account for the fact that different stocks have different trading volumes. Do you see an association between log_volume and log_return in these scatter plots?
Volume-Return Correlation: Use the cor function to compute the pearson’s correlation coefficient between log_volume and log_return for each symbol. Why do you think the correlations are close to 0?
Hint: use summarize and don’t forget that cor is a base R function so you will either need to filter NA values for volume and log_return or appropriately choose the use flag in the argument- see ?cor for more info.
Transformed Correlations Next compute the correlation between abs(log_return) and log_volume using the absolute value function for each symbol. How have the correlations changed from the previous summary?
Problem 4: For this problem we will implement a more complicated mathematical transformation of data by calculating a measure of liquidity for each stock.
Liquidity measures the ability for a given asset to be bought or sold without a large impact on price. Liquid assets can be bought and sold quickly and easily (heavily traded stocks, currency), whereas illiquid assets (houses, art) have large increases or decreases in their price when someone tries to buy or sell them in large quantities. Liquidity is considered an important property of a well functioning financial market, and declines in liquidity have been blamed for worsening or triggering stock market crashes.
Many methods have been invented to measure liquidity, but for this problem we will focus on a method called “Kyle’s \(\lambda\)”. Kyle’s \(\lambda\) estimates liquidity by using a linear regression between the absolute value daily return of a stock and the logarithm of the dollar volume of that stock. The time periods used to estimate this regression can vary (each choice of time periods defines a unique measure of liquidity), and here we will use daily returns and a one month time period (defined as 20 trading days). You will learn a lot about linear models in DATA 606 and other classes, but to be complete, \(\lambda\) is a coefficient in the following linear model: \[
|R_t-1| = c + \lambda \log((\mathrm{Volume})_t (\mathrm{close})_t) + \epsilon_t
\] where the coefficients \(c\) and \(\lambda\) will be calculated to minimize the error \(\epsilon_t\) over the past 20 trading days.
\(\lambda\) stands for the amount that the stock price will move in units of basis points for a given \(\log\) dollar volume of trade. A small \(\lambda\) indicates high liquidity, and a high \(\lambda\) indicates low liquidity.
\(\lambda\) can be be calculated using rolling averages on the time series data with the TTR package, specifically the function runMean which when used within a dplyr pipeline will calculate the mean over the past \(n\) data points. For example, the command:
adds a new variable which is equal to the mean of the log_return over the past 20 days. The mathematical formula for \(\lambda\) is: \[
\lambda = \frac{\mathrm{mean}_{20}(R_a\log( p_c V ))
- \mathrm{mean}_{20}\left(R_a\right) \mathrm{mean}_{20}\left(\log\left(p_c V\right) \right) }
{\mathrm{mean}_{20}\left(\log\left( p_c V \right)^2\right)
-\mathrm{mean}_{20}\left(\log(p_c V)\right)^2 }
\] where to make the formula easier to read we have defined \(R_a = |\mathrm{return} -1|\) (note this is the regular return and not log_return), \(p_c = \mathrm{close}\) and \(V = \mathrm{volume}\), and the averages have been taken over the past 20 days of data.
Calculate Kyle’s \(\lambda\): Add a new variable called kyle to the data frame by implementing the above formula for \(\lambda\). Make sure to read and implement the formula very carefully, and to use the runMean function to calculate the rolling average correctly. I recommend writing a function which computes kyle from the variables close, return, and volume, and then using that function in mutate. You can start with the template code below (I would prefer you code this by hand):
kyle =function(return,close,volume){# put your code here}
Vibe Liquidity- LLM Prompting Exercise: LLMs can be powerful tools for code generation, but they can introduce unexpected bugs and generate code that is difficult to read, understand, and maintain. As such, it is important to verify their output and to be very specific about the style and features you want the output code to have. Without direction, LLMs will recapitulate the style of their training set. Recreate the kyle function above using an LLM of your choice by prompting it with instructions about the style you want and the mathematical formula. How does the output differ from the function you constructed in part (a)? Test the function by comparing it to kyle function you created earlier on the same output. Make sure to tell me the prompt that you used and the LLM as well.
Liquidity in Extreme Events: Add a new variable to the dataframe called extreme which is true when the log_return for a given stock is either greater than 99% of other values of the log_returnor less than 99% of all values of log_return. Use the percent_rankdplyr window function along with logical operators to create this variable. Then for each stock calculate the mean value of Kyle’s \(\lambda\) for the days when the log_return had extreme values and for when it didn’t (as identified by the extreme variable). What do your calculations and figures indicate about liquidity during extreme events?